I will discover the relationship between chemical properties and quality of wine using red wine data. The format includes Univarite, Bivariated and Multivariated analyses with a final summary and reflection at the end.
## [1] 1599 13
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
The red wine data set contains 1599 obersvations with 13 variables. 11 of the variables are chemical features.
The range for fixed acidity is minimum 4.60 and maximum 15.90.
The range for volatile acidity is minimum 0.12 and maximum 1.58.
The range for volatile acidity is minimum 0 and maximum 1.
The median quality is 6 and mean is 5.636. The quality of samples range 3 to 8.
The median PH 3.310 and mean 3.311 . PH varies from 2.720 to 4.010
Examine histogram graphs of all values:
The median quality is 6 and mean is 5.636. The quality of samples range 3 to 8. Most of the quality ratings are either 5 or 6. The most of quality rating is 5.
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
Draw a histogram of the quality ratios:
We creat a new variable that called “rating” which is categorically quality divided into “bad”, “average”, and “excellent”.
** quality < 5 = ‘bad’ quality < 7 = ‘average’ quality > 6 = ‘excellent’**
The median alcohol is 10.20 and mean is 10.42. The quality of samples range 8.40 to 14.90. Most of the quality ratings are either 5 or 6. The most of quality rating is 5. Red Wine data sample is small but it gives the same pattern of alcohol level distribution as red wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Residual sugar, chlorides distribution is long tailed distribution. So I transformed this data for a more accurate distribution. The log10 produces a more understandable distribution for both.
There are 1599 red wine observation with 13 variables in the dataset . 11 of the variables are quantitative features . this Features are PH, density, volatile acidity, Fixed acidity ,citric acid, Free suldur dioxide, Total sulfur dioxide, Sulphates, Alchole.
The final variable of quantity scores the wine from 0 to 10. But Potential range from 3 to 8. All of the features have a minimum value greater than 0.
The main features in the data set are quality. I would like to determine which features are best for predicting the quality of a wine.
Alcohol,fixed acidity,residual sugar likely contribute to the quality of a wine.
I convert some of the continious variables into discrete range. I creat a new variable that called “rating” which is categorically divided into “low”, “average”, and “high”. This grouping method will help me detect the difference among each group more easily.
The residual sugar histogram and Chlorides histogram did not look normal. I applied a log transform to x-axis.
The top 4 correlation coefficients with quality are:
alchol-quality = 0.48
sulphates-quality = 0.26
citric.acid-quality = 0.22
fixed.acidity-quality = 0.12
Alcohol content has a high correlation with red wine quality.
The biggest negative corralation coefficients with quality are:
volatile.acidity-quality = -0.39
total.sulfur.dioxide-quality = -0.19
density-quality = -0.17
chlorides-quality = -0.13
Variables with the highest positive correlation include:
fixed.acidity-citirc.acid = 0.67
fixed.acidity-density = 0.67
free.sulfur.dioxide-total.sulfur.dioxide = 0.67
alcohol-quality = 0.48
sulphates-chlorides = 0.37
Variables with the highest positive correlation include:
fixed.acidity-pH = -0.68
volatile.acidity-citirc.acid = -0.55
citric.acid-pH = -0.54
density-alcohol = -0.50
volatile.acidity-quality = -0.39
The biggest negative corralation coefficient with quality is volatile.acidity and The biggest positive corralation coefficient with quality is alcohol. From the plot, quality increases at moderate rates with higher alcohol. Red wine quality decreases as volatile acidity increases
## wineQualityReds$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.00 10.25 10.90 14.90
## --------------------------------------------------------
## wineQualityReds$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.60 10.00 10.22 11.00 13.10
## --------------------------------------------------------
## wineQualityReds$rating: excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.60 11.52 12.20 14.00
## wineQualityReds$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.4100 0.5400 0.5386 0.6400 1.3300
## --------------------------------------------------------
## wineQualityReds$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2300 0.5650 0.6800 0.7242 0.8825 1.5800
## --------------------------------------------------------
## wineQualityReds$rating: excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4055 0.4900 0.9150
We plot pH and fixed acidity. The correlation coefficient is -0.67, meaning that pH tends to drop at fixed acidity increases, which makes sense.
## [1] -0.6829782
sulphate content is quite important for red wine quality, particularly for the highest quality levels including excellent quality .
## $average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3700 0.5400 0.6100 0.6473 0.7000 1.9800
##
## $bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4950 0.5600 0.5922 0.6000 2.0000
##
## $excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7435 0.8200 1.3600
I am surprised that the relationship between fixed.acidity and ph is the strongest relationship.
From the variables analyzed, the strongest relationship was between fixed.acidity and pH, which had a correlation coefficient of -0.68.
I noticed that quality increased as sulphates increased, When comparing sulphates to alcohol.
For excellent wines, alcohol played a important role in detecting quality given a sulphate level.
We can see higher quality wine have higher alcohol and lower volatile acidity.
We can see higher quality wine have higher sulphates, higher citric acidity.
There is no definitive evidence of the sugar content that is causing the bad wines.
When looking at wine quality level, we see a positive relationship between alcohol and sulphates. also we see a negatif relation between quality and volatile.acidity.
I am suprising that residual sugar has very little impact on wine quality.
This plot shows that the distribution of wine quality. You can show that dataset is unbalanced. It has many count for medium quality, but much fewer count on bad and excellent quality wine.
In general, high quality wine tend to have higher alcohol and lower volatile acidity content. They also tend to have higher sulphate and higher critic acid content.
When sulphates were low, the wine was still rated bad. Low sulphate content appears to contribute to bad wines. Also average wines have higher concentrations of sulphates. Excellent wines have higher alcohol contentrations and higher sulphate contentrations.
Red wine dataset contains information on 1599 red wine that has got different chemical. initially I discover the relationship between chemical properties and quality of wine using red wine dataset. The wine quality is more complex. But plots and visuals make it easier to see where to explore more.
4 features that have the highest correlation coefficient with quality are alcohol, volatile acidity, sulphates,citric acid. Alcohol content appeared to be the number one factor for determining an excellent wine. Additionally excellent red wine contains specific amount of Citric acid and sulfates. Volatile acidity has a negative correlation to wine quality and I am suprising that residual sugar has very little impact on wine quality.
First I understanding the individual variables in the data set, and then I explored different questions and leads as I continued to make observations on plots. I have successfully identified features that impact the quality of red wine, visualized their relationships and summarize their statistics. I explored the quality of wines across many variables.Eventually I realised that good wine is more than perfect combination of different chemical components.
There are very few wines that are rated as low or high quality. I could do a better analysis if I had more information about the wines of the upper and lower classes. More information will certainly improve the accuracy of the prediction models. With this exploratory data analysis on the red wine dataset, I found the biggest challenging was sharing the right amount of information.